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We present an axiomatic approach to the mean and discuss gen- 
eralizations of the mean, including one due to Kolmogorov based on 
the Weak Law of Large Numbers. We offer examples and counterex- 
amples, describe conventional and unconventional uses of the mean 
in statistical mechanics, and resolve an anomaly in quantum theory 
concerning apparent simultaneous coexistence of means and variances 
of observables. These issues all arise from the familiar definition of 
the mean. 



1. Introduction. The most important number summarizing a data set 
is generally thought to be the mean. Some have questioned its utility, com- 
paring it unfavorably with the median, the mode, the midrange. Capitalists 
and communists used to argue over whether mean income or median income 
was the truer measure of citizen well-being. For another example, see Kosko 
[9]. The mean is not robust against outliers: it can be strongly influenced 
by a single observation. This is both a strength and a weakness. Kosko ob- 
jected that not only does a Cauchy random variable not have a well-defined 
mean but the average of independent identically distributed Cauchy ran- 
dom variables is itself a Cauchy variable with the same distribution and 
thus averaging does not reduce variability at all. 

Investigators often pursue the quest for a single number or a small set 
of numbers that capture the essence of a data set, make multiple data sets 
comparable, and provide order to the world of data sets. As data sets get 
larger and larger, thanks to the digital explosion, scrutiny of measures that 
compress data becomes more important. Candidates, in addition to those 
mentioned above, include entropy and various generalized means, but no 
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one has arrived at measures clearly superior to the mean and its associated 
measure, the root mean squared deviation or standard deviation. 

In work with sample data the mean is easy to understand, in contrast with 
other notions from probability theory - such as independence, conditional 
probability, and even probability itself. Some have argued (e.g., de Finetti 
[2], Pollard [12], and Whittle [14]) that the mean is the fundamental notion 
in probability theory and should occupy the central place in all treatments 
of probability. 

In this note we review some properties of the mean, consider some gener- 
alizations for cases when the ordinary mean does not exist, and investigate 
the significance of the mean in state space theory and quantum mechanics. 

We begin by axiomatizing the notion of sample mean. Along with familiar 
axioms for symmetry, homogenity, and translation invariance, we introduce 
a condensation axiom that describes the result of replacing arbitrary values 
by their sample mean. We then use the Strong Law of Large Numbers to 
arrive at the familiar mathematical notion of mean, E{X). 

Thereafter we consider generalizations of the mean. These are not needed 
for bounded or semi-bounded random variables, but really only for variables 
that have heavy-tailed distributions on both right and left, with tails of 
similar size. We consider what happens when a random variable is restricted 
to an interval [c — M, c + M] and M is allowed to tend to infinity. We 
state a theorem (Theorem 3.1) describing the different kinds of behavior 
possible and provide examples of each. One generalization, which is due 
to Kolmogorov, is what we have chosen to call the weak mean, Ey^{X), 
and corresponds precisely to validity of the Weak Law of Large Numbers. 
Yet another generalization, the doubly weak mean, E^^iX), applies to the 
Cauchy distribution. We also discuss multipliers that can be applied to a 
variable X to finitize the mean in the spirit of Feynman and note the dangers 
of such finitizations. Nonetheless, we recognize that attempts to scrutinize 
the notion of mean in connection with the Cauchy distribution and other 
long-tailed distributions are timely. 

Turning to applications, we point out that the mean is a natural tool in 
state space theory for the transition from deterministic models to statistical 
models. We discuss entropy and observe that although it is regarded as a 
mean it is very different from means arising from ordinary observables. We 
recall Jaynes' Maximum Entropy Principle, which seeks to maximize entropy 
subject to given values of conventional means. 

Lastly, we discuss the role the mean plays in quantum theory, and provide 
a precise answer to the question of when the mean and variance exist for a 
particular quantum state and a particular quantum observable. 
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The conclusion, implicit in this discussion, is that the mean is the paramount 
measure, of great and wide utility, instructive even when it falls short. There 
is little prospect of it losing its longtime preeminence. 

2. Axiomatics for the Sample Mean and the Strong Law of Large 
Numbers. Prior to introducing probability measures, let us consider po- 
tential axioms for the mean of a finite set. In this setting, with TZ = (—00, 00), 
the mean can be thought of as a family of functions {/«} for n > 1 with 
fn '■ TV^ — )• Tl- Its properties include the following: 

M-1) (Homogeneity) /n(Axi, . . . , Ax„) = A/„(xi, ... ,Xn) for all 
(xi, . . . , Xn) G 7?." and all X £ TZ; 

M-2) (Symmetry) /„(xi, . . . , x„) = ^(^^^(i) , • • • , a;<7(n)) for ah permuta- 
tions a of the set {1, 2, n}; 

M-3) (Translation Invariance) /n(a^i + c, . . . , x„ + c) = fn{xi, . . . , Xn) + c 
for all (xi, . . . , Xn) G TV^ and all c £ TZ. 

Other properties are the following: 

(Positive Homogeneity) /„(Axi, . . . , Ax„) = Xfnixi, . . . , Xn) for all 
(xi, . . . , Xn) G 7^" and all A > 0; 

(Nonnegativity) If for some (xi, . . . , Xn) and (yi, . . . , yn) G TZ^ 

Xl < yi,. . . ,Xn < yn, then fn{xi, ...,Xn)< fniVl, ■ ■ ■ ^Vn)] 

(Positivity) If for some (xi, . . . , x„) and (yi, . . . , y„) G TZ^ 
xi < yi,...,Xn < yn, and x, < yi for some i, then /„(xi, . . . , x„) < 

fniVl, . . .,yn)- 

(Strict Positivity) If for some (xi, . . . , x„) and (yi, . . . , yn) G TZ^ 
Xi < yi for ah i = 1, n, then /n(xi, . . . ,x„) < /n(yi, • • • ,2/n)- 

(Additivity) /„(xi + yi, . . . , x„ + y^) = fnixi, . . . , Xn) + fniui, ■ ■ ■ ,yn) for 
ah (xi, ...,Xn) and (yi, . . . ,yn) G 7^"; 

The above axioms seem reasonable except for additivity. The measure 
should be independent of units, thus homogeneous, and independent of the 
choice of zero point, and a function of the set rather than the ordered set. In 
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addition, to capture characteristics of the data, ordering properties - nonneg- 
ativity and perhaps positivity - are not unreasonable. However, additivity 
asserts a relationship between the ordering of two data sets that survives 
reordering of one set, and this seems much too restrictive. 

Consider a rival measure to the mean, namely, the median. The median 
of a finite data set {xi, . . . ,Xn} is defined as the midmost of the numbers 
when they are arranged in increasing order if n is odd, and half the sum of 
the two midmost numbers in such an arrangement if n is even. 

The median satisfies homogeneity, symmetry, translation invariance, and 
nonnegativity. Furthermore, any fixed convex combination of the mean and 
the median other than the median itself satisfies homogeneity, symmetry, 
translation invariance, nonnegativity, positivity, and strict positivity. Indeed, 
not only the median, but the maximum and the minimum of {xi, . . . 
(and other rank functions and convex combinations) satisfy positive homo- 
geneity, symmetry, translation invariance, and nonnegativity. 

Proposition 2.1 Let fn '■ TV^ TZ be a function satisfying homogeneity, 
symmetry, and translation invariance. 

1) If n = 1, then fi{x) = x for all x £ TZ. 

2) Ifn = 2, then f2{xi,X2) = for all {xi,X2) G TZ^. 

Proof : Homogeneity implies that /i(0) = /2(0,0) = 0. Translation invari- 
ance then indicates that /i(x) = /i(0 -|- x) = /i(0) + x = x. When n = 
2, 

J., b-a a + b b-a a + b. 
f2{a,b) = /2( ^ + ^'^ + ^) 

b — a b — a a + b 

However, by homogeneity and symmetry /2(— 1,1) = — /2(1,— 1) 
= -/2(-l,l),and/2(-l,l)=0. 

When n = 1 or 2, the median and the mean coincide. However, it is 
obvious that they do not coincide in general when n is 3 or larger. Without 
the requirement of additivity it is natural to inquire whether there is another 
suitable property that will distinguish between the median and the mean. 
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One property that we consider and reject is that fn shall have continuous 
partial derivatives. 

Proposition 2.2 Let fn '■ TZ^ TZ be a function satisfying homogeneity, 
symmetry, and translation invariance that has partial derivatives at each 
point with the partial derivatives continuous at (0, . . . , 0) G 7^'". Then 



jn\X\ ; • • • ) j — 



n 

for all (xi, . . . ,Xn) G 7^". 

Proof : If we differentiate the equation /„(Axi, . . . , Ax„) = Xfn{xi, . . . , Xn) 
with respect to Xj, we obtain: 



dfn dfn 

X—^{\xi,...,XXn) = X—^{xi,...,Xn). 

OXi OXi 

Cancelling A from each side and taking a limit as A approaches 0, we obtain: 



Thus all partial derivatives are constant. Since /n(0, ... ,0) = by homo- 
geneity, /„ has the form: 



fn{x\ Xn) — ttiXi -\- . . . -\- OinXfi' 

Symmetry now dictates that Oi = . . . = a„ and the fact that /n(l, •••;!) = 
/„(0, . . . , 0) + 1 = + 1 = 1 accordingly implies that each ai = ^. 

The continuous differentiability assumption seems to be aimed primarily 
at elimination of the median. So we reject it. Instead we offer as an axiom 
a different property characteristic of the mean. 

M-4) (Condensation) For n > m, 

fn{xi ) • • • 1 Xn) — fnifrn (^^l j • • • ) ^m) i • • • i fm {xii • • • ^ Xm)j X-m+l i ■ ■ ■ i Xn) 

for all (xi, . . . , Xn) G TZ^. 
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This property asserts that if a subset of data is replaced by its "mean" , 
the grand "mean" is not changed. This is the first property that proposes a 
definite relationship between means of sets of different sizes. In view of the 
symmetry axiom (M-3), the statement does not really restrict the order of 
the subset, and as we shall see shortly the statement is only really needed 
in special cases. 

Proposition 2.3 Let /„ : 7^" TZ be a function satisfying homogeneity, 
symmetry, translation invariance, and condensation, that is, M-1, M-2, M-3, 
and M-4- Then 

r / \ Xl -|- . . . + Xn 

Jn\Xi , . . . , J — 

n 

for all (xi, . . . ,Xn) G TZ^- 

Proof : In view of Proposition 2.1 we need only perform an inductive step 
showing that the mean formula holds for n > 3 when it holds for n — 1. 
Consider 

fn , . . . , Xji) 

= fn{ ^ . {Xl + . ..Xn-l)., -^—{Xi + . ..Xn-l),Xn) 

n — 1 n — 1 

1 1 

= /„(0, ...,0,Xn -{Xl + . ..Xn-l)) H 7(2:1 + . ..Xn-l) 

n — 1 n — 1 

= i^n —r{xi + ■■■ Xn-l))fn{0, . . . , 0, 1) H ^-(xi + . . . Xn-l) 

n — 1 n — 1 

= aiXi + ... + QnXn- 

This shows that /„ is a linear function of xi, . . . ,x„. It now follows from 
Proposition 2.2 that it is the mean. 

The proof of Proposition 2.3 requires that M-4 holds in the case when 
m = n — 1. In fact we can get by with the assumption that M-4 holds when 
m = 2. It is easy to see that in this case M-4 also holds for m = 2^. Now 
set n = 2'' +j, where < j < 2''. If j = 0, /„(xi, x„) = fn{c,...,c) = 
c/„(l,..,l) where c = (xi + ... + Xn)/n. If j > 0, then /„( Xl ) ■ . • ; Xn ) — 
fn{c, ...,c,Xm+i, ...Xn) where m = 2^ and c = (xi + ... + Xm)/rn. We now 
replace xi,...,Xm by x'^, x^_j-, 0, so that c = (xi + ... + Xm)/m = 
{x'l + ... + x'ni_j + + ... + 0)/m. Using the symmetry and homogeneity 

imsart-aop ver. 2011/11/15 file: Fin2paper.tex date: October 16, 2012 



THE MEAN 



7 



axioms, we obtain: /„(xi, ...,Xn) = fniimc + Xm+i + ••• + Xn)/m, (mc + 
3;„,+i + ... + x„)/m, 0, ...0) = ((mc + x„,+i + ... + x„)/m)/„(l,...,l,0,...,0) = 
(xi + ... + x„)/m)/„(l., , , .1, 0, 0). Thus we have estabhshed that /„ is 
linear in xi, ...,x„. By Proposition 2.3 is the ordinary mean. 

A further note on axiomatics is that the translation invariance axiom can 
be replaced by 1) = 1 if we also assume that /2(xi, X2) = (xi+X2)/2. 

To pass from the sample mean of a finite set to the usual general notion 
of mean, we introduce a real-valued random variable X. We suppose that 
associated with X is a Borel probability measure Px taking each Borel 
subset A of the real numbers to: 

Px{A) = the probability that X belongs to the set A. 

The mean of X, denoted by E(X) or fixi is defined when x is integrable with 
respect to Px to be: 



One direction of the remarkable Strong Law of Large Numbers (see Pol- 
lard [12, p. 78 and pp. 37-8]) states that if {X„} is a sequence of independent 
random variables with common distribution Px and there exists a constant 
m such that 



as n — )• 00, then each X„ has mean m. Here "almost surely" means outside 
a set of measure zero in the countably infinite product space induced by 
the measure Px (see [12, pp. 99-102]). More briefly, if sample means of 
independent copies of X settle down to something, then that something is 
E{X). This can be regarded as the motivation for the transition from the 
sample mean to the mathematical mean E{X). The general notion of mean 
is derived from the finitary notion considered earlier. 

The other direction of the Strong Law of Large Numbers asserts that if 
E[X) exists, then the sample mean of n identical independent copies of X 
converges almost surely to E[X) as n tend to infinity. For a proof of both 
directions of the Strong Law, see [12, pp. 95-102, p. 105]. For an alternate 
proof due to N. Etemadi, see [12, pp. 106-7]. 




X1 + ... + X, 



n 



converges almost surely to m 



n 
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The transition here from finite samples to infinite populations distin- 
guishes the deductive method from the inductive method. While true sci- 
ence deals comfortably with induction based on finite samples, the deductive 
method of the Greeks (and Isaac Newton) relies on axioms whose relation- 
ship to reality is only approximate and always contingent. 

Indeed, in using the Strong Law of Large Number we are admittedly 
introducing the full panoply of probability theory. It is possible, as noted in 
the Introduction, to represent all of probability theory using the mean as 
the primitive notion. Thus Px{A) can be defined as E{xAi-^))i mean 
of where Xa(^) is the random variable that equals 1 when X is in 

A and when X is not in A. However, since in what follows we plan to use 
probability theory in its conventional form (i.e., according to the axioms of 
Kolmogorov [6] ) , we see no reason to restate measure-theoretic facts in terms 
of the mean as primitive. Indeed, a reason not to do so is that the concept 
of independence, which is also fundamental in probability, is awkward when 
expressed exclusively in terms of means. 

3. Extending the Mean. When x is not integrable with respect to Px, 
the notion E{X) above is inapplicable and we must rely on other notions 
of mean. Richard Feynman was famous for his integration tricks, and some 
of these are recorded in the book of Mathews and Walker [11], based on 
lectures Feynman gave at Cornell. Feynman's tricks partly motivated our 
investigation. 

Perhaps the most obvious generalization is the following: 



for a real number c. 

By the Lebesgue Dominated Convergence Theorem this notion coincides 
with the ordinary mean when x is integrable with respect to Px ■ Kolmogorov 
[6, p. 40], in his great foundational work, noted this option in the case when 
c = and observed that it does not require integrability of Indeed if X 
is a random variable obeying the Cauchy distribution /(x) = l/vr(l + x^), 
then X satisfies L(c) = for any choice of c. 

We mention two related notions of mean: 




L-1) lim.M ~,oo J[a-M,b+M] xPx{dx), and 
L-2) lim.^hi{M,K}^ooI[a-M,b+K] xPx{dx), 



where a <h. 



It is easily seen that L-1 coincides with L((a + 6)/2) since 
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,^1 rCL + b b — a. a + b b — a.^ 

[a-M,b + M] = [^-{M+^),^ + {M + ^)]. 

As for L-2, we have the following result. 



Proposition 3.1 Let X be a random variable with probability measure Px- 
Then for some a and b with a < b 



lim / xPx{dx) 

K,M}^ooJ\a-M.b+K] 



min {K,M}^oo J[a-M,b+K] 

exists if and only if x is integrable with respect to Px ■ 

Proof : If X is integrable on TZ, the limit exists and equals the mean of 
X by Lebesgue's Dominated Convergence Theorem. Conversely, if the limit 
exists, then 



< / xPxidx) < € 

J(b+K.b+K'] 



l(b+K,b+K'] 

for K < K' , both sufficiently large, and any given e. Likewise 
-e < / xPxidx) < 

J[a-M',a-M) 

for M < M' , both sufficiently large. Fatou's Lemma or Levi's Theorem [4, 
p. 172] thus implies that 

0< / xPx{dx)<e, -e< [ xPx{dx)<0 

J(b+k,oo) J(-oo,a-M) 

and thus x is integrable on [0, oo) as well as (oo, 0] and so is integrable on 
TZ = (— oo, oo). 

So, when L-2 exists, it coincides with E(X). 

We now return to the study of L{c). We shall allow — oo < L{c) < oo. 
This gives us a bit more flexibility in characterizing what can happen. 

Lemma 3.1 Let X be a random variable with probability measure Px, and 
let ci and C2 be real numbers with ci < C2 . Then there are three possibilities: 



i) If L{ci) exists in [—00,00], then 
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-^(ci) < liminf / xPx{dx); 

M^oo J[c2-M,C2+M] 

a) If L{c2) exists in [— oo,cxd], then 
L{c2) > limsup / xPx{dx); 

M-foo J[ci-M,ci+M] 

Hi) IfL{ci) and L (02) both exist in [—00,00] , then L{ci] 



L{C2). 

Proof : Suppose ci < C2. Then 



/ xPxidx) = / xPx{dx) 

J[c2-M,C2+M] J[ci~M,ci+M] 

+ xPx{dx) - I xPx{dx). 

J {CT+M.C2+M] J\c-i-M.C9-M) 



'{ci+M,C2+M] J[ci-M,C2-M) 

The second and third terms on the right are both non-negative and accord- 
ingly i) and ii) follow. In the case of iii), note that i) and ii) imply that if 
both L{ci) and L{c2) exist, then L{ci) < L{c2)- 

If L[c2) — L{ci) > 0, then there is a positive constant K (for example, 
any positive number < L{c2) — L{ci)) such that for M sufficiently large: 



K < / xPx{dx) - / xPxidx) 

J{ci+M,C2+M] J[ci~M,C2~M) 

< (C2 + M)Px{{ci +M,C2 + M]) + (M - ci)Px{[ci - M, C2 - M)) 

< (M + d) (Pxiici + M, C2 + M] U [ci -M,C2-M)) 

where d = max { | C2 1 , | ci | } . Thus 
K 



< Px{{ci + M, C2 + M] U [ci - M, C2 - M)). 



M + d 

Now replace M by Mj = M + j{c2 — ci) for each integer j > to get: 
^ < Pxiici + Mj, C2 + Mj] U [ci - Mj,C2 - Mj)). 



Mj + d 



Summing over these inequalities and noting that C2 + Mj = ci + Mj^i and 
ci — Mj = C2 — we obtain: 
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K 

°° = E M^^^-( ^ ^ ^^((^1 + M, oo) U (-00, C2 - M)) < 1 

J^qM + d + j{c2 -ci) 
for a contradiction. Thus, this case is eUminated. So L{ci) = L{c2)- 



Theorem 3.1 Let X he a random variable with probability measure Px- 
Then exactly one of the following possibilities holds: 

i) L{c) does not exist in [—00,00] for any real number c; 

a) L{c) exists in (—00,00) for exactly one real number c; 

Hi) L{c) exists in [—00,00] for all real numbers c and is independent of c; 

iv) there is a number cq such that L{c) = 00 for c > cq and L{c) does not 
exist for c < cq; or 

v) there is a number cq such that L{c) = —00 for c < cq and L[c) does 
not exist for c > cq. 

Proof : By Lemma 3.1 it suffices to show what happens when L(c2) = L{ci) 
is finite. In this case the last two terms in the equation at the beginning of 
the proof of Lemma 3.1 each tend to as M tends to infinity. By a change 
of variable, we obtain for the positive number c = C2 — ci. 

hm / xPx{dx) = lim / xPx{dx) = 0; 

Assume < d < c and M > 0. Then 

< / xPxidx) < I xPxidx) 

J{M,d+M] J(M,c+M] 

and 

/ xPxidx) < [ xPxidx) < 

J[-M-c-M) J[-M-d,-M) 

Thus if ii) holds for c, it holds for d. On the other hand, if ii) holds for c it 
also holds for nc where n is any fixed positive integer since 

/ xPxidx) = f2[ xPxidx) 

J{M,nc+M] ~[ J{{j-l)c+M,jc+M] 



and 



[ xPx{dx)=J2[ xPxidx), 

J[-M-nc-M) J^iJ[-M-{n+l-j)c,-M-{n-j)c) 
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and if the M' s are chosen far enough out so that the individual integrals 
are closer to zero than e/n, the sum integral is within e of 0. Finally since 
any positive real number d is smaller than nc for some positive integer n, 
all cases are covered. Accordingly, ii) implies iii). 

The argument for i) implies ii) can now be used to show that for any two 
real numbers ci and C2, if either L[ci) or L{c2) exists in [—00,00], then the 
other exists and equals it since the approximating integrals differ by two 
integrals on intervals of length |ci — C2I that tend to zero as M tends to 
infinity. Thus iii) implies iv). 

We give some examples to illustrate that each of the possibilities enumer- 
ated in Theorem 3.1 can occur. 

Consider a random variable X whose probability measure is of the form 

00 -. , 

Px{A) = ^(_522„(^) + ^^5_22„-i(A)) 
n=l 

where 6z is the (Dirac) probability measure whose value is 1 on any Borel 
subset ^ of 7^ that contains the real number z and whose value is zero 
otherwise. The sum of the nonzero values is one, so this obviously defines a 
probability measure. However the integral of x over the interval [c— M, c+M] 
is the difference between the size of the first set and the size of the second 
set below: 

The size of the set {n : 1 < n, 2^" < (c + M)} 

The size of the set {n : 1 < n, 2^'""^ < (M - c)} 

where [ J is the fioor function. For fixed c and sufficiently large M the 
difference of the above quantities can assume the values and —1 and the 
integral does not settle down to either one. This is an instance of Theorem 
3.1, part i). 

A random variable X can also be defined with probability measure of the 
form 

00 1 

^^(^) = E ^('^2«(A) +<^(_2.)(^)). 
n=l ^ 

So Px is concentrated at the points ±2" and assigns probability 1/(2'^"'"^) 
to such points. For this measure, L(0) equals by symmetry. However, L[c) 



log (c + M) 
2 log 2 ^ 

log (M - c) 1 
2 log 2 ^ 2 
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does not exist for other choices of c. If c is positive, the integral of x over the 
closed interval [c — M, c + M] reduces to its integral over the open interval 
(M — c, M + c] and this integral oscillates between and ^ for large M 
depending on whether 2" is in the interval (M — c, M + c] or not. Similar 
behavior occurs when c < 0. This example is an instance of Theorem 3.1, 
part ii). 

Now consider a random variable X having a probability density (with 
respect to Lebesgue measure on the real line) of the form 

f{x) = \ -f ^ 

[ l+Dlxlb II ^ ^ U 

where a and b are numbers in (1,2) and C and D are suitable positive 
constants that guarantee that the density integrates to 1. Notice that this 
random variable satisfies iii) of Proposition 3.1. It is easy to see that L{c) = 
oo for all c or — oo for all c according as 6 > a or o > The example 
illustrates part iii) of Theorem 3.1 (as does the Cauchy distribution with 
L(c) ^ 0) . 

The probability measure 

n=l 

illustrates part iv) of Theorem 3.1. If c > 0, the integral of x over [c — M,c + 
M] is given by: 

on— 1 on— 2 

E V- E 



3 ^ 3 ' 

{n : l<n,3"<c+A/} {n : l<n,3"<A/-c} 

and this expression has the value (2"o-i - 2~i)/3 or (2"o~i + 2"o~2 _ 2-I) /3 
where uq ~ (log c + M) j log 3 for large M. Since M and tiq tend to infinity 
together, it follows that L(c) = 00 for c > 0. 

On the other hand, if c = —d where d > 0, the integral of x over [c — 
M, c + M] = [-M -d,M - d\ is given by: 

nn— 1 on— 2 

EL ^ — ^ L 
o / ' o ' 

{n : l<n,3"<A/-d} {n : l<n,3"<M+(i} 

and this reduces to (2"°""'^ — 2~^)/3 or to ( — 1)/6 for large M where no ~ 
(log M — d)/ log 3 depending on whether a positive integer lies in the interval 
(log M — dj log 3, log M + d/ log 3] or not. Thus, L(c) does not exist for c < 
0. 
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A final example (also for part iv) of Theorem 3.1 is the case where the 
probability measure is given by: 

oo nn on— 1 

Px{A) = i^E( 3.^(^^^) W/n)(-4) + ^'^-3"(^)). 

Here X is a suitably chosen positive normalizer, which is easily seen to be 
smaller than 1/3. For c > 0, the integral of x over [c — M, c + M] is 

{n : l<n,(3"+l/n)<Af+c} {n : l<n,3"<M-c} 

and this reduces to K[2'^° — 1) for M sufficiently large where no is the largest 
integer such that 3"° < M — c. Since uq and M tend to infinity together, 
L(c) = oo for all c > 0. 

When c = 0, the integral of x over [— M, M] reduces to K{2^'' — 1) or to 
—K where no is the largest integer such that 3"° + (1/7t,o) < M and the first 
or second reduction occurs according as M < 3"°"^^ or not. Thus L(0) does 
not exist. By Theorem 3.1 L(c) does not exist for c < 0. 

Other cases arising in Theorem 3.1, such as part v), are obtained by 
modifying the examples above, e.g., replacing X by —X or by X + a. 

4. Weak Means and Multipliers. One of the implications of Theo- 
rem 3.1 is that if L(c) exists for more than one choice of c and is finite in some 
case then it is finite for all c and is independent of c. The case of the Cauchy 
distribution shows that this can happen without the ordinary mean existing. 
Accordingly for a random variable X, we define the doubly weak mean of 
X, denoted by to be the common value of L(c) for all c when this 

common value exists and is in (—00,00). 

We also introduce an intermediate notion due to Kolmogorov between 
the ordinary mean and the doubly weak that motivates our terminology. 
The weak mean of X, denoted by Eu,(X), is defined as follows: £'«,(X) 
is the quantity L(0) provided the latter exists in (—00,00) and provided 
lim„^oonPx(|-'^| > n) = 0. 

The following proposition is due to Kolmogorov. It indicates that existence 
of the weak mean coincides precisely with the existence of a number for which 
the Weak Law of Large Numbers holds. 

Proposition 4.1 (Kolmogorov, 1928) Let X be a random variable. Sup- 
pose that {Xi, ■ ■ ■ , Xn, ■ ■ ■} are independent identically distributed copies of 
X with Pn the n-fold product distribution. Then there is a real number m 
such that for each e > 
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n—>oo fi 

if and only if X has weak mean Ew{X) = m. 

Proof : See [6, p. 65], [7], and [8, Theorems XII and XIII]. 

Corollary 4.1 Let X be a random variable. 

i) If X has a mean, then X has a weak mean and E{X) = Ew{X); 
and 

a) if X has a weak mean, then X has a doubly weak mean and Ew{X) = 

Proof : In case X has a mean, then the identity function x i— )• x is integrable 
with respect to the probabihty measure Px on the real line. In particular 
the tail integrals 

/ xPxidx) and / xPx{dx) 

tend to zero as n tends to infinity. Since the absolute values of these integrals 
are larger respectively than nPx{X > n) and nPx{X < —n), it follows that 
lim.„^oo nPx{\X\ > n) = 0. Likewise by Lebesgue's Dominated Convergence 
Theorem, L{0) = E{X), Suppose X has a weak mean. Then if ci < C2 and 
e > and a sufficiently large M are given, 

0< / xPxidx) < {c2 + M)Px{{ci + M,C2 + M]) 

J{ci+M,C2+M] 

< {c2-ci + ci + M)Px{\X\> ci + M) 

< {c2-ci + ci + M)Px{\X\>n) 

C2 -ci Ci + M 

< H e 

n n 

where n = [ci + MJ . For M sufficiently large, the right side is as close to e 
as we like. Thus 



lim / xPx{dx) = 0. 

V/-5.00 J(c, +M,co+M] 



Similarly, 



lim / xPx{dx) = d. 

M^oo J[c-^-M,C2-m) 
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Accordingly from the first equation in the proof of Lemma 3.1, it fohows 
that when one of L{c2) or L{ci) exists and is finite, the other exists and is 
equal to it. Since L{0) = m, it follows that L(c) exists for all c, L{c) = m 
and m is the doubly weak mean of X. 

Kolmogorov in [6, p. 66] gives an example where the Weak Law holds but 
the Strong Law does not. Cauchy random variables have L{c) existing for 
all c, independent of c, but violate the Weak Law by not decaying rapidly 
enough at infinity. Thus the mean, weak mean, and doubly weak mean are 
strictly distinct notions. 

We make one more observation on generalizations of the mean, based 
on using multipliers to attempt to finitize the mean. These multipliers are 
a type of "mollifier." Usually mollifiers are used to aid approximation of 
the delta function and to smooth functions, but another use is to regularize 
behavior at ±00. The idea is to introduce a function (j)x{x) that depends on a 
parameter A so that x 1— )• (/>_)^(x)x is integrable with respect to Px for A 7^ Aq 
and (t>\{x) —7- 1 for a. e. x as A — )• Aq- In the case of L(c), the multiplier can 
be taken to be 



where xa is the characteristic function of the set A and A = 1/M. 

Multipliers, and indeed other straight-forward generalizations of the mean 
including the weak and doubly weak mean, are useful only when the following 
equations hold: 

/ xPx{dx) = 00 

J[0,oo) 

/ xPxidx) = —00. 

J(-oo,0] 

If neither of these equations holds, x is integrable and the mean is well- 
defined. If only the first equation holds, the mean is +00, and if only the 
second equation holds, the mean is —00. If both equations hold, then there 
is some room for maneuver. L(c) cannot exist finitely unless the infinities 
on each end are of the same order. If for example Px is given by a density 
function / with respect to Lebesgue measure such that /(x) decays as l/x"^ 
as X —7- 00 and decays as l/|xp/^ as x — t- —00, then L(c) = —00 for all c. 

Multipliers offer possibilities for extending the notion of the mean. They 
can be of use in such activities as renormalization where the aim is to rein- 
terpret integrals to make them finite. In our case we set: 
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E„^ult{X) = lim E{cl>^{X)X) 

provided this limit exists. This method is used to "evaluate" the integrals 
of s'mbx and sinx/x on [0, oo) in [11, p. 60 and p. 91].) 

However, there are dangers that the following example illustrates. 

Define a function (p^^c ^'^^ A > and c in TZ by: 



e~^^ if X > 

e^^(l + 7rcA2;) if x < 0. 



Here c is an arbitrary constant. Evidently ^ is a well-behaved function, 
integrable and dying off at ±oo. Also {(pxc} converges pointwise to the 
constant function one as A tend to 0^ with fixed c. 

Suppose we use this family of functions as a multiplier to determine a 
mean for a variable obeying the Cauchy distribution. Let m(A, c) be defined 
by: 

/oo J, rO cXe^^x"^ 

Now 



l = e^^|0^ = r Xe^-'dx 

Ae^^x^ , ^ Ae-^^x^ , 

> / — — 2" dx = m(A, 1) = / — — — 2- dx 
' 1 + x^ Jo 1 + a; 

, °° Ae-^^x2 , K'^e-^^^ 



Ik 1 + x2 - 1 + 7^2 
for any positive real number K. Thus 



K 



2 



1 > lim sup m(A, 1) > lim inf m(\, 1) > „ . 

- A^0+ ^ ' ^ - A^0+ ^ - 1 + 

Letting K tend to infinity, we find that limA-!^o+ ?ti(A, 1) = 1. 

Hence the multiplier-induced mean of the standard Cauchy distribution 

is: 



/oo ^ 

(PxA^)~n~, — 7\dx = lim ?n(A,c) 
-oo Vr(l + X^j A-i-0+ 

= lim cm(A, 1) = c lim m( A, 1) = c. 

A^0+ A^0+ 
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However, c was arbitrary depending on the choice of the multipHer! 

Although some may consider the Cauchy distribution anomalous, we re- 
mind the reader that its legitimacy and importance stem in part from the 
fact that it is the quotient of two independent standard normal random 
variables. It has application in physics under the name of the Lorentz dis- 
tribution. Indeed long-tailed and counter-intuitive distributions are increas- 
ingly important in recent times (see Gumble [3] or Taleb [13]) in financial 
mathematics, the study of natural and man-made disasters, and computer 
network analysis. Extending the notion of mean to such distributions, and 
investigating the limits of the notion of mean in such settings, are among 
the ways of moving beyond the normal regime. 

5. State Space Theory. State Space Theory or System Theory is 
widely used to provide a mathematical description of physical systems in- 
cluding those of classical mechanics as well as other systems such as biologi- 
cal and social systems. The state of the system at any time is taken to be an 
element of a set S called state space. The evolution of the state is given by a 
function Tt : S — )■ 5 taking the state s at time to the state Tt{s) at time t. 
A (real-valued) observable is any function f : S ^ R which assigns to each 
state s a number f{s) (see Mackey[10]). All observables may be determined 
from the state, and indeed the state can be viewed as a maximal indepen- 
dent set of observables that characterize the system at a given time. The 
dynamic evolution of the state is deterministic and time may be taken to be 
either discrete or continuous. Evolution of an observable / can be expressed 
by t ^ f °Tt{s), i.e., the value of the observable at time t is obtained by 
applying the observable function to the state at time t. 

A familiar example of the state space approach is Hamiltonian mechanics. 
The state space in this case is phase space, and a state is a 2n-tuple {q,p) = 
{qi, ...,qn,Pi, ■■■,Pn) consisting of position coordinates qi and momentum co- 
ordinates Pi. The evolution is Tt{q,p) = {q{t),p{t)), where the latter is the 
solution to Hamilton's equations with initial data {q{0),p{0)) = {q,p): 



for i = 1,2, Here H{q,p) is the Hamiltonian function of the system, 
which is assumed to be a continuously differentiable function on state space 
representing the total energy of the system. The function H is an example of 
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an observable, as are the position and momentum coordinates, angular mo- 
menta QiPj — QjPi , et cetera. A differentiable observable / evolves according 
to the equation: 

dt ^ dqi dpi dpi dqi 

the right-hand side being the definition of the Poisson bracket [/, H], under 
which operation C°° observables form a Lie algebra. 

Given a deterministic state space it is natural to pass to a statistical 
setting as follows. We replace the old states s by new states that are (Borel) 
probability measures P on the state space S. The old observables / on the 
original state space are replaced by new observables that are the means of 
the old observables with respect to the probability measure P. Thus for any 
original state space observable /, the map 

P ^ Ep{f) 

defines an observable on the set of probability measures. If / is bounded, 
this observable is defined for all probability measures. If not, it is defined 
for those measures with respect to which / is integrable. 

If the only observables allowed were obtained in this manner, this would 
appear to be a severe limitation. However, the variance of f and all moments 
of f can themselves be regarded as means of original observables. Indeed, 
even the probability distribution for / can be regarded as a mean. This is 
because on a Borel subset A of the reals, the probability that / takes a value 
in A is given by Ep{xa ° f)^ where Xa is the characteristic function of the 
set A. 

The evolution of the probabilistic state can be induced by an underlying 
deterministic evolution. The probability measure at time t, Pt, is given by 
Pt{A) = P{T^t{A)) where A is any (Borel) subset of S. This permits us 
to talk about the evolution of observables since the mapping t i— )• Ep^{f) 
describes such an evolution. In the Hamiltonian formalism phase space has 
a natural 2n-dimensional Lebesgue measure A called Liouville measure with 
infinitesimal volume element dqi...dqndpi...dpn, and X{Tt{A)) = X{A) for 
all Borel subsets A oi S and all times t. Dynamics in phase space can be 
thought of as a fluid flow that permits change of shape but no change in 
volume. The probability state P can often be taken to be the integral of a 
probability density function p{q,p) with respect to A. At the other extreme 
P can be taken to be a delta function 6{q — qo)S{p — po), which reduces to 
the deterministic theory with state s = {qo,Po)- The probabilistic setting 
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also permits us to abandon the deterministic evolution {Tt} and work with 
a stochastic evolution exclusively, e.g., one of Markov type. 

A use of means in state space theory that we have not touched on here 
relates to ergodic theory, in which time averages of observables over trajec- 
tories are compared with averages over state space regions using a suitable 
normalized volume measure. 

The essential point is that means provide the transition from classical 
observables for deterministic systems to statistical observables for stochastic 
systems. 

6. Entropy. A subtlety occurs in statistical mechanics that is not present 
in ordinary probability theory. An observable is commonly defined as a real- 
valued function of the state, and in statistical mechanics the state is a prob- 
ability measure P on state space. Thus any real-valued function of P can be 
taken to be an observable, e. g.. Pi—)- P{B) is an observable where B is any 
fixed Borel set in the state space S. This observable is an expected value 
since P{B) = Ep{xb)- However, not all observables arise as expected values 
of original observables. The most familiar example of such an observable is 
the entropy function, which can be interpreted as an expected value (mean) 
but is not a conventional mean. 

To avoid certain difficulties associated with the continuous case we will 
confine our attention to the case where the underlying state space is a finite 
set. Let S" be a finite state consisting of n states. A classical observable is a 
function / : S — >• (00,00). A discrete classical evolution might be a function 
T : S ^ S such that if i is the state at a given time then T{i) is the state one 
time unit later. (A continuum of times presents a problem for deterministic 
evolution in a finite state space, although that problem does not arise in the 
probabilistic setting.) 

When we pass to a statistical notion of state, we arrive at a probability 
vector p = {pi, ...,Pn) where Pi is the probability that the system is in the de- 
terministic state i. We can now form expected values of classical observables 
/, i-e., 



Superficially the entropy appears to be another mean value, the mean value 
of the "uncertainty" log(l/pj), also called the "surprise value" (The log 



E{f) = Y.pJ{i) 



as noted before. We can also form such expressions as the entropy: 
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here is usually taken with base 2.) Thus the entropy of a probability state 
is the mean uncertainty of the state. This is not the mean of a classical 
observable since the function i i— )• log(l/pi) is not a classical observable. 
Classical observables should exist and be measurable prior to assignment 
of probabilities, but it makes no sense to consider the uncertainty function 
until probabilities have been introduced. The dependency of the uncertainty 
function on i is not intrinsic and is only determined through the postulated 
probability state pi. 

It happens that entropy has another relationship to means of considerable 
importance, namely through the Maximum Entropy Principle (MEP), also 
known as Jaynes' Principle. In the absence of an evolutionary law T and 
an initial assignment, we are faced with the problem of determining the 
probability state p, i. e., an assignment of probabilities to the deterministic 
states i. The MEP [5, p. 370] asserts that: 

The probability state p maximizing entropy subject to the given values ai, ...a^ for 
the means of known classical observables gi, gk provides predictions "most strongly 
indicated by our present information." 

Using the calculus of variations, we can in general determine a unique 
distribution among those that satisfy the constraints 



for j = l,...,k and maximizing H{p), namely, the one with the probability 
assignment 



for i = 1, n where . . . ^fif, are constants determined from the a^'s, and 
C is a positive normalizing constant chosen so that the sum of the piS is 1. 

The interpretation of this result takes two forms (at least). Suppose the 
states are those of an individual particle in a gas of particles. Then the 
quantities ai, ...ak represent measured values of the total value of gi, .-.^gk 
over the entire gas divided by A^. The probabilities pi, derived from the 
MEP, are the probabilities that a particle picked at random from among the 
N particles is in the i-th state. They may also be regarded as the fraction 
of particles that are in the i-th state. We may not care about individual 
particles but we do care about these fractions, which can be taken to define 
the macroscopic state of the gas (volume, pressure, temperature, and the 
like). This is the ensemble viewpoint of Gibbs. Yet another perspective is to 




Pi = Cexp{-(^ Pj9j{i))] 



j 
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regard what we usually observe as a small perturbation about values induced 
by means. 

7. Quantum Issues. The mean plays a pivotal role in quantum the- 
ory, even if this role has not been examined closely in most treatments of 
quantum theory. In quantum mechanics the state of a physical system is 
described by a wave function ij: that is an element of a Hilbert space %. 
(Strictly speaking is not a function but an equivalence class of functions, 
and in addition each state is associated with a ray in Hilbert space.) Each 
physical observable that takes on real- number values (e.g., a position coor- 
dinate, a momentum coordinate, the energy, a spin component) is associate 
with a self-adjoint operator ^ in For simplicity each observable is de- 
noted by the same symbol "A" as the associated operator. Any self-adjoint 
operator A has in turn an associated projection-valued measure Pa (see, for 
example, [10]) that assigns to each Borel set 5 in 7^ an orthogonal projection 
Pa{S) in the Hilbert space: 

S^Pa{S) 

in such a way that A is an integral combination of these orthogonal projec- 
tions, represented symbolically by: 

A = I xPA{dx), 
Jn 

or by: 

Aii^) = [ xPA{dx){^), 
Jn 

where ip is in the domain of A. If a measurement is made, the probability 
that the value of A is in the set S when the system state is ip defined to be: 

{PA{sm,ij) = \\PA{sm\\' 

where < , > is the inner product on Ti, linear in the first variable and 
conjugate-linear in the second variable, and || || is the norm on T-L. 

Quantum Mechanics is thus a statistical theory based on a family of prob- 
ability measures defined by: 

s^\\PA{sm\\'- 

These are the Borel probability measures associated with observables A 
when the system state is ip. One consequence of this is that the set of possible 

imsart-aop ver. 2011/11/15 file: Fin2paper.tex date: October 16, 2012 



THE MEAN 



23 



values of A is the spectrum of the operator A, and another is that the mean 
of A, when the state is ip, is given by: 



In particular the quantity {PA{S){'ip),tp) = \ \Pa{S){^)\\'^ can be interpreted 
as the mean of the observable -Pa ('5) when the state is ip. The observable 
Pa{S) is an orthogonal projection, taking the value 1 when the value of A is 
in is S and the value when the value of A is not in S. Thus ||P4(S')('i/')|P 
also represents the probability that A is in S* when the state is V'- This is a 
reminder that all probabilities are means. 

The mean, {A{ip),ij)), is the integral over the real line of the real vari- 
able X with respect to the Borel probability measure ||P4( jCV')!!^- Thus the 
mean exists, it appears, if and only if x is integrable with respect to this 
measure, thus if and only if is in the domain of A. Self-adjoint operators 
have domains that are dense in H but many of the most prominent ones 
(e.g., those associated with position and momentum and often energy) do 
not have domain equal to %. Hence there will be states for which the means 
of some observable may not be well-defined. Whether these states are real- 
izable in practice is uncertain, but there is no good theoretical reason why 
they should be ignored. (Our discussion focuses on mathematical definition 
and characterization. The spectrum of a self-adjoint operator is identified 
with the possible values of a measured quantity. If the spectrum is discrete, a 
measurement may be able to distinguish one value from another; if the spec- 
trum is continuous, measurement will only be able to determine an interval 
that contains the value, not the exact value. Repeated measurements when 
the system is in the same state thus only arrive at a rough approximation 
of the distribution and a rough estimate of the mean for a state.) 

A curiosity in quantum mechanics, not ordinarily seen in other applica- 
tions of probability, is the following. Suppose = (A(^/^),'0) is the mean of 
some observable A when the state is ip. Then the variance of the observable 
in this state is naturally given by: 



So the variance exists if and only if ip is in the domain of A. The condition 
for the mean to exist is the same as the condition for the variance to exist. In 
quantum mechanics we are led to think that the only distributions for which 
the mean is finite are ones in which the variance is also finite. However, a 
closer look at this situation reveals some discrepancies. 




Ij^-I^f {PA{dxm,^) = ((A-/iI)2(V;),^) = ||(A-/.I)(V)||2 
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The chief discrepancy is the following. Suppose that the original observ- 
able A can be written in the form 



where C and D are non-negative self-adjoint operators. Non-negative self- 
adjoint operators can be written as squares of self-adjoint operators, so that 
C = E"^ and D = F"^ with E and F self-adjoint. Then 



Thus, the mean of A exists if and only if is in the intersection of the domain 
of E and the domain of F. It is easy, incidentally, to construct examples of 
elements of H that are in the domain of a self-adjoint operator E but are 
not in the domain of its square E'^. In addition, as it happens, it is possible 
to offer explicit candidates for the operators E and F given A. Set 



We conclude that the mean of A exists when tp is in dom E n dom F and 
the variance of A exists when tp is in dom A. If tp is not in dom E but is 
in dom F, then it is reasonable to say that the mean of ^ is oo. Likewise if 
tp is in dom E but not in dom F, the mean of A is — oo. If is in neither 
dom E nor dom F, then the ordinary mean does not exist. In the spirit of 
our discussion of L(c) earlier, it is possible to truncate the integrals for E 
and F in the last display, replacing oo by M and — oo by — M and investigate 
the existence of an appropriate combined limit as M tend to infinity. 

Similar considerations can be applied when the "pure" state ip is re- 
placed by a density matrix representing a statistical ensemble of pure states, 
or in rigged Hilbert spaces where the existence of states varies accord- 
ing to the properties of the observable, or to cases arising by the use of 
positive-operator-valued measures generalizing the projection-valued mea- 
sures treated above. 

8. Conclusion. The mean, as we have seen, is ubiquitous in scientific 
explanation. Not only does it provide a summary of sample data and, when 
it exists, of data from the entire population, but it establishes a connection 
between samples and the whole population. Furthermore, it facilitates gener- 
alization of deterministic observables that are functions of the deterministic 
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state to probabilistic observables that are functions of the probabihstic state. 
The Maximum Entropy Principle then makes use of constrained means to 
identify macroscopic distribution of physical importance parametrized by 
these means. While quantum mechanics abandons determinism, it retains 
the notion of mean to summarize the possible results of experiments and 
the measurement of quantum observables. Although not all observables have 
finite means, weak and doubly weak means and the alternatives identified in 
Theorem 3.1 provide an enumeration of possible behaviors of variables and 
associated probability distributions, and give further insight into potential- 
ities associated with large data sets. 
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